Traditionally, reinforcement learning has operated on "tabular" state spaces, e.g. "State 1", "State 2", "State 3" etc. However, many important and interesting reinforcement learning problems (like moving robot arms or playing Atari games) are based on either continuous or very high-dimensional state spaces (like robot joint angles or pixels). Deep neural networks constitute one method for learning a value function or policy from continuous and high-dimensional observations.
In this miniproject, you will teach an agent to play the Lunar Lander game from OpenAI Gym. The agent needs to learn how to land a lunar module safely on the surface of the moon (at coordinate [0,0]). The state space is 8-dimensional and (mostly) continuous, consisting of the X and Y coordinates of the lander, the X and Y velocity, the angle of the lander, the angular velocity, and two booleans indicating whether the left and right leg of the lander have landed on the moon.
The agent gets a reward of +100 for landing safely and -100 for crashing. In addition, it receives "shaping" rewards at every step. It receives positive rewards for moving closer to [0,0], decreasing in velocity, shifting to an upright angle and touching the lander legs on the moon. It receives negative rewards for moving away from the landing site, increasing in velocity, turning sideways, taking the lander legs off the moon and for using fuel (firing the thrusters). The largest reward it can achieve on a step is about +-100. The best score an agent can achieve in an episode is about +250;
There are two versions of the task: one with discrete controls and one with continuous controls. In the discrete version, the agent can take one of four actions at each time step: [do nothing, fire engines left, fire engines right, fire engines down]. In the continuous version, the agent sets two continuous actions at each time step: the amount of engine thrust and the direction.
We will use Policy Gradient approaches to learn the task. In the previous miniprojects, the network generates a probability distribution over the outputs, and is trained to maximize the probability of a specific target output given an observation. In Policy Gradient methods, the network generates a probability distribution over actions, and is trained to maximize expected future rewards given an observation.
Reinforcement learning is noisy! Normally one should average over multiple random seeds with the same parameters to really see the impact of a change to the model, but we won't do this due to time constraints. However, you should be able to see learning over time with every approach. If you don't see any improvement, or very unstable learning, double-check your model and try adjusting the learning rate.
You may sometimes see "AssertionError: IsLocked() = False" after restarting your code. To fix this, reinitialize the environments by running the Gym Setup code below.
You will not be marked on the episode movies. If your notebook file is large before uploading, delete them.
The miniproject is marked out of 15, with a further mark breakdown in each question:
We may perform random tests of your code but will not rerun the whole notebook.
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
return false;
}
Before you start, please enter your sciper number(s) in the field below; they are used to load the data. The variable student_2 may remain empty, if you work alone.
sciper = {'student_1': 292070,
} #'student_2': 217033}
seed = sciper['student_1']#+sciper['student_2']
import gym
import numpy as np
import matplotlib.pyplot as plt
import logging
from matplotlib.animation import FuncAnimation
from IPython.display import HTML, clear_output
from gym.envs.box2d.lunar_lander import heuristic
import keras
import tensorflow as tf
from tensorflow.contrib.distributions import Beta
from keras.models import Sequential, Model, model_from_json, load_model
from keras.layers import Dense, Lambda, Input, Dropout
from keras.optimizers import Adam
from keras import backend as K
import time, datetime
import dill
np.random.seed(seed)
tf.set_random_seed(seed*2)
import sys
import resource
print (resource.getrlimit(resource.RLIMIT_STACK))
print (sys.getrecursionlimit())
max_rec = 0x200000
# May segfault without this line. 0x100 is a guess at the size of each stack frame.
resource.setrlimit(resource.RLIMIT_STACK, [0x100 * max_rec, resource.RLIM_INFINITY])
sys.setrecursionlimit(max_rec)
Here we load the Reinforcement Learning environments from Gym (both the continuous and discrete versions).
We limit each episode to 500 steps so that we can train faster.
gym.logger.setLevel(logging.ERROR)
discrete_env = gym.make('LunarLander-v2')
discrete_env._max_episode_steps = 500
discrete_env.seed(seed*3)
continuous_env = gym.make('LunarLanderContinuous-v2')
continuous_env._max_episode_steps = 500
continuous_env.seed(seed*4)
gym.logger.setLevel(logging.WARN)
% matplotlib inline
plt.rcParams['figure.figsize'] = 12, 8
plt.rcParams["animation.html"] = "jshtml"
We include a function that lets you visualize an "episode" (i.e. a series of observations resulting from the actions that the agent took in the environment).
As well, we will use the "Results" class (a wrapper around a python dictionary) to store, save, load and plot your results. You can save your results to disk with results.save('filename') and reload them with Results(filename='filename'). Use results.pop(experiment_name) to delete an old experiment.
def AddValue(output_size, value):
return Lambda(lambda x: x + value, output_shape=(output_size,))
def render(episode, env):
fig = plt.figure()
img = plt.imshow(env.render(mode='rgb_array'))
plt.axis('off')
def animate(i):
img.set_data(episode[i])
return img,
anim = FuncAnimation(fig, animate, frames=len(episode), interval=24, blit=True)
html = HTML(anim.to_jshtml())
plt.close(fig)
!rm None0000000.png
return html
class Results(dict):
def __init__(self, *args, **kwargs):
if 'filename' in kwargs:
data = np.load(kwargs['filename'])
super().__init__(data)
else:
super().__init__(*args, **kwargs)
self.new_key = None
self.plot_keys = None
self.ylim = None
def __setitem__(self, key, value):
super().__setitem__(key, value)
self.new_key = key
def plot(self, window):
clear_output(wait=True)
for key in self:
#Ensure latest results are plotted on top
if self.plot_keys is not None and key not in self.plot_keys:
continue
elif key == self.new_key:
continue
self.plot_smooth(key, window)
if self.new_key is not None:
self.plot_smooth(self.new_key, window)
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.legend(loc='lower right')
if self.ylim is not None:
plt.ylim(self.ylim)
plt.show()
def plot_smooth(self, key, window):
if len(self[key]) == 0:
plt.plot([], [], label=key)
return None
y = np.convolve(self[key], np.ones((window,))/window, mode='valid')
x = np.linspace(window/2, len(self[key]) - window/2, len(y))
plt.plot(x, y, label=key)
def save(self, filename='results'):
np.savez(filename, **self)
# my utilities
def save_model(model, name):
model.save("obj/" + name + ".h5")
def load_my_model(name):
model = keras.models.load_model("obj/" + name + ".h5")
return model
def get_agent_reward(name):
model_policy = load_my_model(name+ '_policy')
try:
model_baseline = load_my_model(name+ '_baseline')
except OSError:
with open("obj/" + name + ".pkl", 'rb') as f:
recurrent_agent = dill.load(f)
recurrent_agent.update_model(model_policy)
rewards = dill.load(f)
return recurrent_agent, rewards
else:
with open("obj/" + name + ".pkl", 'rb') as f:
recurrent_agent = dill.load(f)
recurrent_agent.update_model(model_policy, model_baseline)
rewards = dill.load(f)
return recurrent_agent, rewards
def save_agent_reward(recurrent_agent, rewards, name):
with open("obj/" + name + ".pkl", 'wb') as f:
dill.dump(recurrent_agent, f)
dill.dump(rewards, f)
save_model(recurrent_agent.model_policy, name + '_policy')
print('model baseline:', recurrent_agent.model_baseline)
if recurrent_agent.model_baseline:
print('Saving model baseline!')
save_model(recurrent_agent.model_baseline, name + '_baseline')
To get an idea of how the environment works, we'll plot an episode resulting from random actions at each point in time, and a "perfect" episode using a specially-designed function to land safely within the yellow flags.
Remove these plots before submitting the miniproject, to reduce the file size.
def run_fixed_episode(env, policy):
frames = []
observation = env.reset()
done = False
while not done:
frames.append(env.render(mode='rgb_array'))
action = policy(env, observation)
observation, reward, done, info = env.step(action)
return frames
def random_policy(env, observation):
return env.action_space.sample()
def heuristic_policy(env, observation):
return heuristic(env.unwrapped, observation)
episode = run_fixed_episode(discrete_env, random_policy)
render(episode, discrete_env)
episode = run_fixed_episode(discrete_env, heuristic_policy)
render(episode, discrete_env)
This is the method we will call to setup an experiment. Reinforcement learning usually operates on an Observe-Decide-Act cycle, as you can see below.
You don't need to add anything here; you will be working directly on the RL agent.
num_episodes = 30
def run_experiment(RLAgent_es, experiment_name, env, num_episodes, learning_rate=0.001, baseline=None, old_params=None, graph=True):
rewards = []
startin_reward_len = 0
#Initiate the learning agent
if old_params:
agent = old_params[0]
rewards = old_params[1]
startin_reward_len = len(rewards)
else:
agent = RLAgent_es(n_obs = env.observation_space.shape[0], action_space = env.action_space,
learning_rate = learning_rate, discount=0.9, baseline = baseline)
all_episode_frames = []
step = 0
for episode in range(1, num_episodes+1):
#Update results plot and occasionally store an episode movie
episode_frames = None
if episode % 10 == 0:
results[experiment_name] = np.array(rewards)
if graph:
results.plot(10)
if episode % 500 == 0 or episode == num_episodes:
episode_frames = []
#Reset the environment to a new episode
observation = env.reset()
episode_reward = 0
while True: # in every episode there is a full trip to the moon
if episode_frames is not None:
episode_frames.append(env.render(mode='rgb_array'))
# 1. Decide on an action based on the observations
action = agent.decide(observation)
# 2. Take action in the environment
next_observation, reward, done, info = env.step(action)
episode_reward += reward
# 3. Store the information returned from the environment for training
agent.observe(observation, action, reward)
# 4. When we reach a terminal state ("done"), use the observed episode to train the network
if done:
rewards.append(episode_reward)
print('episode number:', episode + startin_reward_len, 'reward:', episode_reward)
if not graph:
print('episode number:', episode + startin_reward_len, 'reward:', episode_reward)
if episode_frames is not None:
all_episode_frames.append(episode_frames)
agent.train() # in this way I'm training every episode, at the end of the episode!
break
# Reset for next step
observation = next_observation
step += 1
return all_episode_frames, agent, rewards
Here we give the outline of a python class that will represent the reinforcement learning agent (along with its decision-making network). We'll modify this class to add additional methods and functionality throughout the course of the miniproject.
NOTE: We have set up this class to implement new functionality as we go along using keyword arguments. If you prefer, you can instead subclass RLAgent for each question.
class RLAgent(object):
def __init__(self, n_obs, action_space, learning_rate, discount, baseline = None):
#We need the state and action dimensions to build the network
self.n_obs = n_obs
#We'll treat the continuous case a bit differently
self.continuous = 'Discrete' not in str(action_space)
if self.continuous:
self.n_act = action_space.shape[0]
self.act_low = action_space.low
self.act_range = action_space.high - action_space.low
else:
self.n_act = action_space.n
self.lr = learning_rate
self.gamma = discount
self.moving_baseline = None
self.use_baseline = False
self.use_adaptive_baseline = False
if baseline == 'adaptive':
self.use_baseline = True
self.use_adaptive_baseline = True
elif baseline == 'simple':
self.use_baseline = True
#These lists stores the cumulative observations for this episode
self.episode_observations, self.episode_actions, self.episode_rewards = [], [], []
self.model_policy = None
self.model_baseline = None
#Build the keras network
self._build_network()
def observe(self, state, action, reward):
""" This function takes the observations the agent received from the environment and stores them
in the lists above. If necessary, preprocess the action here for the network. You may also get
better results clipping or normalizing the reward to limit its range for training."""
raise NotImplementedError
def decide(self, state):
""" This function feeds the observed state to the network, which returns a distribution
over possible actions. Sample an action from the distribution and return it."""
raise NotImplementedError
def train(self):
""" When this function is called, the accumulated observations, actions and discounted rewards from the
current episode should be fed into the network and used for training. Use the _get_returns function
to first turn the episode rewards into discounted returns. """
raise NotImplementedError
def _get_returns(self):
""" This function should process self.episode_rewards and return the discounted episode returns
at each step in the episode, then optionally apply a baseline. Hint: work backwards."""
raise NotImplementedError
def _build_network(self):
""" This function should build the network that can then be called by decide and train.
The network takes observations as inputs and has a policy distribution as output."""
raise NotImplementedError
def update_model(self, model_policy, model_baseline = None):
self.model_policy = model_policy
print('Update model:', model_policy, model_baseline)
if model_baseline:
print('Loaded model baseline')
self.model_baseline = model_baseline
Implement the REINFORCE Policy Gradient algorithm using a deep neural network as a function approximator.
WARNING: Running any experiments with the same names (first argument in run_experiment) will cause your results to be overwritten.
Mark breakdown: 5 points total
class RLAgent_Ex1(RLAgent):
def __init__(self, n_obs, action_space, learning_rate, discount, baseline = None):
super(RLAgent_Ex1, self).__init__(n_obs, action_space, learning_rate, discount, baseline)
self.epsilon = 0.1
print('RLAgent 1')
def observe(self, state, action, reward):
""" This function takes the observations the agent received from the environment and stores them
in the lists above. If necessary, preprocess the action here for the network. You may also get
better results clipping or normalizing the reward to limit its range for training."""
self.episode_observations.append(state)
self.episode_actions.append(action)
self.episode_rewards.append(reward)
def decide(self, state):
""" This function feeds the observed state to the network, which returns a distribution
over possible actions. Sample an action from the distribution and return it."""
state = np.expand_dims(state, axis=0)
actions = self.model_policy.predict(state)[0]
actions_ind_prob = []
return actions.argmax()
def train(self):
""" When this function is called, the accumulated observations, actions and discounted rewards from the
current episode should be fed into the network and used for training. Use the _get_returns function
to first turn the episode rewards into discounted returns. """
episode_steps = len(self.episode_observations)
num_actions = 4
inputs = np.asarray(self.episode_observations)
targets = np.zeros((episode_steps, 1))
moving_avarage_value, moving_avarage_index = [], 0
print((self.episode_rewards[-3:]), sum(self.episode_rewards))
print('actions:', self.episode_actions)
for t in range(episode_steps):
G = 0 # discounted returns
for k in range(t, episode_steps):
G += pow(self.gamma, k - t) * self.episode_rewards[k]
if self.use_adaptive_baseline:
state_reshaped = np.expand_dims(self.episode_observations[t], axis=0)
G_reshaped = np.expand_dims(G, axis=0)
_, _ = self.model_baseline.train_on_batch(state_reshaped, G_reshaped)
adaptive_baseline = self.model_baseline.predict(state_reshaped)
#print('adaptive baseline:', G, adaptive_baseline, G - adaptive_baseline)
G = G - adaptive_baseline
elif self.use_baseline:
avg_period = 20
if moving_avarage_index < avg_period:
moving_avarage_index += 1
moving_avarage_value.append(G)
else:
moving_avarage_value.pop(0)
moving_avarage_value.append(G)
#print('before baseline:', G,(sum(moving_avarage_value)) / moving_avarage_index)
G = G - (sum(moving_avarage_value)) / moving_avarage_index
targets[t] = pow(self.gamma, t) * G #, int(self.episode_actions[t])
# print(inputs)
# print(targets)
loss, _ = self.model_policy.train_on_batch(inputs, targets)
# print(loss)
#These lists stores the cumulative observations for this episode
self.episode_observations, self.episode_actions, self.episode_rewards = [], [], []
def _get_returns(self):
""" This function should process self.episode_rewards and return the discounted episode returns
at each step in the episode, then optionally apply a baseline. Hint: work backwards."""
def _build_network(self):
""" This function should build the network that can then be called by decide and train.
The network takes observations as inputs and has a policy distribution as output """
print(self.use_baseline, self.use_adaptive_baseline)
optimizer_adam = Adam(lr= self.lr)
model = Sequential()
model.add(Dense(128, activation='relu', input_dim=8))
model.add(Dropout(0.4))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(12, activation='relu'))
model.add(Dense(4, activation='softmax'))
model.compile(optimizer=optimizer_adam,
loss=REINFORCE, metrics=['acc'])
self.model_policy = model
if self.use_adaptive_baseline:
optimizer_adam = Adam(lr=self.lr)
model = Sequential()
model.add(Dense(32, activation='relu', input_dim=8))
model.add(Dense(20, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(1))
model.compile(optimizer=optimizer_adam,
loss='MSE',
metrics=['accuracy'])
self.model_baseline = model
def REINFORCE(target, output):
# target[:,0]: disocounted reward, target[:, 1]: action taken
reduced = tf.reduce_max(output, axis=-1)
target = tf.reduce_max(target, axis=1)
print(reduced)
print('target:', target)
a = tf.multiply(target, tf.log(reduced))
a = tf.reduce_mean(a, axis=0)
print(a) # uso a così per debugging
return a # devo per caso mettere -a?
# Sum up losses instead of mean
def categorical_crossentropy(target, output):
_epsilon = tf.convert_to_tensor(10e-8, dtype=output.dtype.base_dtype)
output = tf.clip_by_value(output, _epsilon, 1. - _epsilon) # selu
return tf.reduce_sum(- tf.reduce_sum(target * tf.log(output),axis=len(output.get_shape()) - 1),axis=-1)
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
return np.exp(x) / np.sum(np.exp(x), axis=0)
learning_rate = 0.001
#Supply a filename here to load results from disk
results = Results()
start_time = time.time()
name = 'REINFORCE'
#agent, rewards = get_agent_reward(name)
episodes, recurrent_agent, rewards= run_experiment(RLAgent_Ex1, name, discrete_env, 10, learning_rate,
#old_params=(agent, rewards),
graph=True)
print('saving models..')
#save_agent_reward(recurrent_agent, rewards, name)
print('time:', datetime.timedelta(seconds=time.time() - start_time))
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
SVG(model_to_dot(recurrent_agent.model_policy).create(prog='dot', format='svg'))
with tf.Session() as sess:
writer = tf.summary.FileWriter('logs', sess.graph)
writer.close()
render(episodes[-1], discrete_env)
name = "REINFORCE (with baseline)"
#agent, rewards = get_agent_reward(name)
episodes, recurrent_agent, rewards= run_experiment(RLAgent_Ex1, name, discrete_env, 2, learning_rate=0.001,
baseline='simple',
#old_params=(agent, rewards),
graph=True)
print('saving models..')
#save_agent_reward(recurrent_agent, rewards, name)
print(rewards[-1])
render(episodes[-1], discrete_env)
Add a second neural network to your model that learns an observations-dependent adaptive baseline and subtracts it from your discounted returns, to reduce variance in learning.
TECHNICAL NOTE: Some textbooks may refer to this approach as "Actor-Critic", where the policy network is the "Actor" and the value network is the "Critic". Sutton and Barto (2018) suggest that Actor-Critic only applies when the discounted returns are bootstrapped from the value network output, as you saw in class. This can introduce instability in learning that needs to be addressed with more advanced techniques, so we won't use it for this miniproject. You can read more about state-of-the-art Actor-Critic approaches here: https://arxiv.org/pdf/1602.01783.pdf
Mark breakdown: 2 points total
name = "REINFORCE (adaptive baseline)"
#agent, rewards = get_agent_reward(name)
episodes, adaptive_agent, rewards= run_experiment(RLAgent_Ex1, name, discrete_env, 10, learning_rate,
baseline='adaptive',
#old_params=(agent, rewards),
graph=True)
print('saving models..')
#save_agent_reward(adaptive_agent, rewards, name)
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
SVG(model_to_dot(adaptive_agent.model_baseline).create(prog='dot', format='svg'))
render(episodes[-1], discrete_env)
Ideally, our value network should have learned to predict the relative values across the input space. We can test this by plotting the value prediction for different observations.
Mark breakdown: 3 points total
def grid_creation(x_density, y_density):
xs1 = np.linspace(-1, 1, num=x_density)
xs2 = np.linspace(-0.2, 1, num=y_density)
xx, yy = np.meshgrid(xs1, xs2) # create the grid
ex = np.zeros((len(xx) * len(xx[0]), 2))
print(ex.shape)
for j in range(y_density):
for i in range(x_density):
ex[y_density * i + j, 0] = xx[j, i]
ex[y_density * i + j, 1] = yy[j, i]
return xx, yy, ex
def fill_input_tensor(grid_points, X, Y, wL, wV, pad0, pad1):
inputs = np.zeros((grid_points.shape[0], 8))
inputs[:, 0] = grid_points[:,0]
inputs[:, 1] = grid_points[:,1]
inputs[:, 2] = X
inputs[:, 3] = Y
inputs[:, 4] = wL
inputs[:, 5] = wV
inputs[:, 6] = pad0
inputs[:, 7] = pad1
return inputs
baseline_net = adaptive_agent.model_baseline
x_density, y_density = 200, 120
xx, yy, grid_points = grid_creation(x_density, y_density)
for t in range(4):
if t == 0:
X, Y, wL, wV, pad0, pad1 = 0, 0, 0, 0, 0, 0
else:
X = np.random.uniform(-1, 1)
Y = np.random.uniform(-0.2, 1)
wL, wV = np.random.uniform(-np.pi, np.pi), np.random.uniform(-np.pi, np.pi)
inputs = fill_input_tensor(grid_points, X, Y, wL, wV, pad0, pad1)
predictions = baseline_net.predict(inputs)
predictions = np.squeeze(predictions)
classification_plane = predictions.reshape((y_density, x_density))
plt.figure(t)
plt.contourf(xx, yy, classification_plane, cmap=plt.cm.jet)
plt.colorbar()
plt.title('X:' + str(X) + ' Y:' + str(Y) + ' wL:' + str(wL) + ' wV:' + str(wV) + ' pad0:' + str(pad0) + ' pad1:' + str(pad1))
Question: Is there a combination of variables in the ranges above for which you see the highest rewards? Do they make sense?
Answer:
Question: What about outside of the ranges above? Why might these produce higher values?
Answer:
Question: Are the values higher before or after the legs touch the surface? Why?
Answer:
One disadvantage of Q-learning-type approaches is that they require that the agent take discrete actions ("left", "right", "up", "down" etc.). In policy gradient, the agent learns a distribution over actions for each observation. That distribution can be either discrete (as we saw above) or continuous.
Here we will switch to continous actions. The agent has a 2D action at each time step: a value in [-1,1] to control the amount of thrust, and a value in [-1,1] to control the left/right direction of the thrust. Since the output is bounded, we will model it with a Beta distribution: http://en.wikipedia.org/wiki/Beta_distribution.
A Beta distribution is defined by 2 parameters: alpha and beta. The network should output both for each action. We will ensure that alpha >= 1 and beta >= 1, which keeps the distribution unimodel and well-behaved. The agent then samples from a distribution defined by [alpha,beta] for each action and transforms the [0,1] output to [-1,1] to act.
Modify your model in the following ways when it detects that self.continuous is True:
Mark breakdown: 5 points total
class RLAgent_Ex4(RLAgent):
def __init__(self, n_obs, action_space, learning_rate, discount, baseline = None):
super(RLAgent_Ex4, self).__init__(n_obs, action_space, learning_rate, discount, baseline)
self.epsilon = 0.1
print('RLAgent 4')
def observe(self, state, action, reward):
""" This function takes the observations the agent received from the environment and stores them
in the lists above. If necessary, preprocess the action here for the network. You may also get
better results clipping or normalizing the reward to limit its range for training."""
self.episode_observations.append(state)
self.episode_actions.append(action)
self.episode_rewards.append(reward)
def decide(self, state):
""" This function feeds the observed state to the network, which returns a distribution
over possible actions. Sample an action from the distribution and return it."""
state = np.expand_dims(state, axis=0)
output = self.model_policy.predict(state)[0]
action1_value = np.random.beta(output[0], output[1])* 2 - 1
action2_value = np.random.beta(output[2], output[3])* 2 - 1
return np.array([action1_value, action2_value])
def train(self):
""" When this function is called, the accumulated observations, actions and discounted rewards from the
current episode should be fed into the network and used for training. Use the _get_returns function
to first turn the episode rewards into discounted returns. """
episode_steps = len(self.episode_observations)
num_actions = 2
inputs = np.asarray(self.episode_observations)
targets = np.zeros((episode_steps, num_actions+1))
moving_avarage_value, moving_avarage_index = [], 0
#print(sum(self.episode_rewards))
for t in range(episode_steps):
G = 0 # discounted returns
for k in range(t+1, episode_steps):
G += pow(self.gamma, k - t - 1) * self.episode_rewards[k]
if self.use_adaptive_baseline:
state_reshaped = np.expand_dims(self.episode_observations[t], axis=0)
G_reshaped = np.expand_dims(G, axis=0)
_, _ = self.model_baseline.train_on_batch(state_reshaped, G_reshaped)
adaptive_baseline = self.model_baseline.predict(state_reshaped)
#print('adaptive baseline:', G, adaptive_baseline, G - adaptive_baseline)
G = G - adaptive_baseline
elif self.use_baseline:
avg_period = 20
if moving_avarage_index < avg_period:
moving_avarage_index += 1
moving_avarage_value.append(G)
else:
moving_avarage_value.pop(0)
moving_avarage_value.append(G)
#print('before baseline:', G,(sum(moving_avarage_value)) / moving_avarage_index)
G = G - (sum(moving_avarage_value)) / moving_avarage_index
print(G)
targets[t] = (self.episode_actions[t][0]+1)/2, (self.episode_actions[t][1]+1)/2, (pow(self.gamma, t) * G)
loss = self.model_policy.train_on_batch(inputs, targets)
print('loss:', loss)
#These lists stores the cumulative observations for this episode
self.episode_observations, self.episode_actions, self.episode_rewards = [], [], []
def _get_returns(self):
""" This function should process self.episode_rewards and return the discounted episode returns
at each step in the episode, then optionally apply a baseline. Hint: work backwards."""
def _build_network(self):
""" This function should build the network that can then be called by decide and train.
The network takes observations as inputs and has a policy distribution as output """
print(self.use_baseline, self.use_adaptive_baseline)
optimizer_adam = Adam(lr=self.lr)
state_input = Input(shape=(8,))
h1 = Dense(24, activation='relu')(state_input)
h2 = Dense(48, activation='relu')(h1)
h3 = Dense(24, activation='relu')(h2)
output = Dense(4, activation='softplus')(h3)
final_output = Lambda(lambda x: x + 1, output_shape=(output.shape[0],))(output)
model = Model(input=state_input, output=final_output)
adam = Adam(lr=0.001)
model.compile(loss=beta_loss, optimizer=adam)
self.model_policy = model
if self.use_adaptive_baseline:
optimizer_adam = Adam(lr=self.lr)
model = Sequential()
model.add(Dense(32, activation='relu', input_dim=8))
model.add(Dense(20, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(1))
model.compile(optimizer=optimizer_adam,
loss='MSE',
metrics=['accuracy'])
self.model_baseline = model
def beta_loss(target, output):
# VOGLIO COME TARGET LE AZIONI CHE HO FATTO (CASOMAI MOLTIPLICATE PER UN TARGET)
# ho l'output che è alpha e beta (due coppie), devo trovare la probabilietà per ogni coppia e poi fare la log
action1_prob = Beta(output[:, 0], output[:, 1])
action2_prob = Beta(output[:, 2], output[:, 3])
a = action1_prob.log_prob(target[:, 0])
b = action2_prob.log_prob(target[:, 1])
result = tf.reduce_sum((a + b) * target[:, 2], axis=-1)
return result
learning_rates = [0.001]
c_models = []
results.plot_keys = []
for lr in learning_rates:
experiment_name = ("Continuous REINFORCE (learning rate: %s)" % str(lr))
results.plot_keys.append(experiment_name)
episodes, model, rewards = run_experiment(RLAgent_Ex4, experiment_name, continuous_env, 49, lr, baseline='simple')
c_models.append(model)
render(episodes[-1], continuous_env)
The code you've written above can be easily adapted for other environments in Gym. If you like, try playing around with different environments and network structures!